Composition and Decomposition of Japanese Katakana and Kanji Morphemes for Decision Rule Induction from Patent Documents
نویسندگان
چکیده
We propose a new method to construct a word list for rule induction from Japanese patent documents. For word segmentation in Japanese, statistical morphological analyzers have been used in many applications. However, the output of these morphological analyzers presents defects when analyzing unknown words, specifically words that contain Kanji/Katakana morphemes. Some words are overly segmented, and their original meanings are obscured. Furthermore, boundaries between compound nouns are uncertain, which impedes investigation in the initial stages of the application. In our method, we first perform morphological analysis to segment sentences into morphemes. Second, segmented compound words are filtered by character types and Katakana/Kanji morphemes in the compound words are concatenated. Third, the concatenated morphemes are truncated to reduce verbosity. Then, words comprising Katakana/Kanji are retained for use in a word list for rule induction. The experiment results show that our method is effective for extracting decision rules for patent classification.
منابع مشابه
The Effects of Word Frequency for Japanese Kana and Kanji Words in Naming and Lexical Decision: Can the Dual-Route Model Save the Lexical-Selection Account?
The effects of word frequency were examined for Japanese Kanji and Katakana words in 6 experiments. The sizes of frequency effects were comparable for Kanji and Katakana words in the standard lexical decision task. In the standard naming task, the frequency effect for Katakana words was significantly smaller than that for Kanji words. These results were consistent with the lexical-selection acc...
متن کاملKeyboards for inputting Japanese language-arxiv
The most commonly used Japanese alphabets are Kanji, Hiragana and Katakana. The Kanji alphabet includes pictographs or ideographic characters that were adopted from the Chinese alphabet. Hiragana and Katakana are phonetic alphabets that do not include any characters common to each other or to Kanji. Hiragana is used to spell words of Japanese origin, while Katakana is used to spell words of wes...
متن کاملPhonological-orthographic consistency for Japanese words and its impact on visual and auditory word recognition.
In most models of word processing, the degrees of consistency in the mappings between orthographic, phonological, and semantic representations are hypothesized to affect reading time. Following Hino, Miyamura, and Lupker's (2011) examination of the orthographic-phonological (O-P) and orthographic-semantic (O-S) consistency for 1,114 Japanese words (339 katakana and 775 kanji words), in the pres...
متن کاملThe relatedness-of-meaning effect for ambiguous words in lexical-decision tasks: when does relatedness matter?
Effects of the number of meanings (NOM) and the relatedness of those meanings (ROM) were examined for Japanese Katakana words using a lexical-decision task. In Experiment 1, only a NOM advantage was observed. In Experiment 2, the same Katakana words produced a ROM advantage when Kanji words and nonwords were added. Because the Kanji nonwords consisted of unrelated characters whereas the Kanji w...
متن کاملSegmenting Sentences into Linky Strings Using D-bigram Statistics
It is obvious that segmentation takes an important role in natural language processing(NLP), especially for the languages whose sentences are not easily separated into morphemes. In this s tudy we propose a method of segmenting a sentence. The system described in this paper does not use any grammatical information or knowledge in processing. Instead, it uses statistical information drawn from n...
متن کامل